Data Analysis with R and Git

Exploring and Analyzing Satellite Gridded data with R and Git Code Tracking

Arlette Simo Fotso (Researcher, INED) &
Mariam Bougma (PhD student, INED)

Workshop Objectives:

This workshop aims to :

  1. Equip researchers with practical skills to download, explore, and analyze gridded and satellite data in R

. . .

  1. Instruct participants in best practices for replicability and open science, with Git and GitHub integrated to R

Agenda of the day

  • Short Introduction to Spatial Objects and gridded data
  • Git introduction
  • Lunch break
  • Git Hands-on session
  • R & Git Hands-on session : From gridded data to data frame

Agenda of the day

  • Short Introduction to Spatial Objects and gridded data

  • Git introduction

  • Lunch break
  • Git Hands-on session

  • R & Git Hands-on session : From gridded data to data frame

Short Introduction to Spatial Objects Gridded data

Geospatial Perspective

  • Geospatial analysis concerns what happens where, and makes use of geographic information that links features and phenomena on the Earth’s surface to their locations.

  • There are a few different concepts when it comes to spatial information:

    • Place: A parcel of land, neighborhood, city, country, state, or nation-state.

      • In geospatial analysis, the basis of a rigorous and precise definition of place is a coordinate system.
    • Attributes: The term ‘attributes’ usually refers to records in a data table associated with individual elements in a spatial data file. Each row includes elements that allow us to place it on a map.

    • Objects: In spatial analysis, it is customary to refer to places as objects.

Remote Sensed and gridded data

  • Much of the observation data for entire classes of variables—or for regions with sparse data coverage—comes from satellites.

  • In their raw form, these remotely sensed data are often complex to work with.

  • Fortunately, these data are often pre-processed

  • Opening a window of opportunity for social scientists to use them (see Kugler et al)

  • Derived products—usually combined with other sources of data—provide almost-ready-to-use gridded representations of interest to researcher working on human–environment interactions.

    • for example, the CHIRPS rainfall dataset uses the TRMM satellite dataset as an input.

Satellite Data Processing Levels

  • There is a set of terminology that NASA uses to refer to the levels of processing it conducts:

    • Level 0 & 1 is the raw instrument data that may be time referenced. It is the most difficult to use.

    • Level 2 is Level 1 data that has been converted into a geophysical quantity through a computer algorithm (known as retrieval). This data is geo referenced and calibrated.

    • Level 3 is Level 2 data that has been mapped on a uniform space time grid and quality controlled.

    • Level 4 is Level 3 data that has been combined with models or other instrument data.

Satellite Data Processing Levels

  • There is a set of terminology that NASA uses to refer to the levels of processing it conducts:

    • Level 0 & 1 is the raw instrument data that may be time referenced. It is the most difficult to use.

    • Level 2 is Level 1 data that has been converted into a geophysical quantity through a computer algorithm (known as retrieval). This data is geo referenced and calibrated.

  • Level 3 is Level 2 data that has been mapped on a uniform space time grid and quality controlled.

  • Level 4 is Level 3 data that has been combined with models or other instrument data

Warning

  • This processing is one of the sources of differences between data products, and for climate model output,
  • Precipitation is a special beast. It is spatiotemporally highly heterogeneous (it can rain a lot in one place, and not rain at all on the other side of the hill, or an hour or a minute later) and difficult to measure accurately.
    • Make sure to read documentation/studies evaluating your chosen data product

Rasters

  • Gridded data is usually stored in a raster file format.

  • A raster is a spatially explicit matrix or grid where each cell represents a geographic location. Each cell corresponds to a pixel on a surface.

  • Raster data is commonly used to represent spatially continuous phenomena such as elevation, temperature, air quality, etc.

  • The size of each pixel defines the resolution of the raster.

  • The smaller the pixel size, the finer the spatial resolution.

  • The extent or spatial coverage of a raster is defined by the minimum and maximum values for both x and y coordinates.

Other types of spatial objects: vector data

  • Spatial objects are usually represented by vector data.
  • Such data consists of a description of the “geometry” or “shape” of the objects, and normally also includes additional variables.
  • For example, a vector data set may represent the borders of the countries of the world (geometry), and also store their names.
  • The main vector data types:
    • points: has one coordinate pair, and n associated variables.
    • lines: are represented as ordered sets of coordinates.
    • polygons: refers to a set of closed poly-lines.
    • network: a special type of lines geometry where there is additional information about things like flow, connectivity, direction, and distance.

Coordinate Reference Systems (CRS)

  • A very important aspect of spatial data is the coordinate reference system (CRS) that is used.
  • A geographic coordinate system (GCS) is a reference framework that defines the locations of features on a model of the earth.
    • Its units are angular, usually in degrees.
    • The datum is the part of the GCS that determines the model used to calculate these angles.
  • A projected coordinate system (PCS) is flat.
    • It contains a GCS, but it converts that GCS into a flat surface.
    • Its units are linear, most commonly in meters.
    • A projection is a mathematical method for representing the curved surface of the earth on a flat map.

CRS More Details

  • The commonly used datum: World Geodetic System 1984 (WGS 84).
    • Other GCS: International Terrestrial Reference Frame (ITRF), Geodetic Reference System 1980 (GRS 80).
  • One commonly used projection is Universal Transverse Mercator (UTM) CRS: good for distance calculation.
    • Other CRS are: “Mercator,” “UTM,” “Robinson,” “Lambert,” “Sinusoidal,” and “Albers.”
  • Most commonly used CRSs have been assigned an EPSG code (EPSG stands for European Petroleum Survey Group).
  • EPSG code can be accessed here.
  • Vector data can be transformed from lon/lat coordinates to planar and back without loss of precision. This is not the case with raster data.

Introduction to Git and GitHub/Gitlab

What is Git and GitHub/Gitlab

  • Git is a version control software created in 2005
    • It is installed on your computer, and you can use it in Rstudio
    • It is used to easily deal with different versions of text file (code, latex files, txt documents, etc.)
    • It is a free and open-source software
  • Github is a website that uses Git, and acts as a central hub for your Git project
    • It allows you to store your code online, share it with others, and collaborate on it
    • It is a free is a proprietary developer platform
    • It is not the only platform that uses Git, but it is the most popular one ( Gitlab is another one)

Why use Git and Github?

  • To Keep track of the different versions of your work
  • To collaborate with others (especially on code)
  • For open science: to share your code and access code shared by others.
    • If you want to publish the code from your paper analysis and have it be reproduced,
    • or contribute to code that someone else has published
  • This is even easier thanks to the integration of version control to Rstudio

How does Git and Github work?

  • A Git repository is equivalent to your project folder
    • Each collaborator is able to modify its files in this repository by creating successive commits
    • This will be usually done in Rstudio in our case
  • Github acts as the central repository storing all the files and their commit history (remote repository or origin)
    • All collaborators have a copy of this remote repository locally, and must synchronise with it, using pushes and pulls

Some vocabulary

  • Repository : a folder where version are followed by Git
  • Remote repository : the folder on the remote platform (here Github)
  • Local repository : the folder on your computer
  • Origin : other way to say the remote repository
  • Commit : changes saved inside git’s version history
  • Pull : updates the local repository with new remote commits
  • Push : updates the remote repository with new local commits
  • Conflict : when the remote and local repository have to different versions of the same part of the code
  • Git vs github : the software that keeps track of version vs. the website that allows you to store online and share this version control with others

Good practice:

  • Make sure you always pull before you push!
  • communicate with you collaborators on the parts you are working on, such that you are not modifying the same file at the same time
    • If you modify the same lines in the same file, you can have a merge conflict!
    • But don’t worry if this happens, you can resolve the merge conflict by choosing one by one the version you want to keep
    • Or work with branches (https://happygitwithr.com/git-branches) to limit merge conflicts (more relevant for developers, we will not cover it here)

General steps to use Git and Github with Rstudio

  1. Link Rstudio with your Github account (you will only have to do it once for each computer)
  2. Create a repository on your Github
  3. Clone it on your computer through Rstudio
  4. Add documents, code, etc… and start working on them on your computer!
  5. Commit and push on Github or pull modification from the remote repository
  6. Forking a repository
    Apart from step 1 you can do these steps in any order, depending on your needs.
    We will do it during the hands-on session

Git and GitHub Hands-on Session

1.2 From Rstudio Create SSH (Secure Shell) Key

  • In RStudio, go to Tools > Global Options > Git/SVN
  • Click on Create SSH Key to generate a new SSH key
  • In the window that appears, hit the Create button (no need of passphrase) and Close this window
  • Click, View public key
  • Copy the generated key to your clipboard

1.3 From Github paste the copied SSH Key

  • Log in to your GitHub account
  • go to settings, then to SSH and GPG keys
  • Click on New SSH key, paste the key from Rstudio, and give it a title
  • Click Add SSH key to save it
  • You are set! You will not have to repeat these steps for all your future projects on this computer.

1.3 From Rstudio Configure Git via the Terminal

  • In RStudio, open the terminal (bottom left panel)

  • Set your commit email address in Git by typing:

    • git config –global user.email “YOUR_EMAIL“
  • Confirm that you have set the email address correctly in Git

    • git config –global user.email
  • Set a Git username by typing:

    • git config –global user.name “Your name”
  • Confirm that you have set the Git username correctly

    • git config –global user.name
  • Be careful use the quotation marks!

2. Create a repository on your Github

  • On GitHub, click on the + icon in the top right corner and select New repository
  • Give your repository a name (e.g., test_yourname)
  • Optionally, add a description
  • Choose whether to make it public or private
  • Check the box to initialize the repository with a README file
  • Click Create repository
  • You can also fork an existing repository if you want to work on it (we will see this later)

3 Clone it on your computer through Rstudio

3.1From github copy the SSH url

  • go to “test_yourname” repository
  • Click on the green Code button
  • then click on the SSH tab
  • copy the SSH url to clip

3.2 From Rstudio paste the copied SSH URL

  • In RStudio, go to File > New Project > Version Control > Git
  • Paste the URL of your repository in the Repository URL field
  • Choose a directory where you want to clone the repository
  • Click Create Project
  • This will create a new RStudio project linked to your GitHub repository

4. Add documents, code, etc… and start working on them on your computer!

  • You can now add files to your project folder (e.g., R scripts, data files, etc.)
  • For example, in RStudio, you can create new files by going to File > New File > R Script
  • save the script under “test_script” in your project folder

5. Commit and push on Github or pull modification

5.1 Commit your changes

  • In RStudio, go to the Git tab in the top right panel
  • You will see the files you have modified or added
  • Check the box next to the files you want to commit
  • click on the Commit pending changes
  • Write a commit message in the Commit message field (e.g., “Initial commit”)
  • Click Commit to save your changes locally

5.2 Push your changes to GitHub

  • After committing, click on the Push button in the Git tab to upload your changes to GitHub

5.3 Pull changes from GitHub

  • If you want to update your local repository with changes made by others on GitHub, click on the Pull button in the Git tab
  • This will fetch the latest changes from the remote repository and merge them into your local repository

5.4 Resolving conflicts

  • If you and another collaborator have made changes to the same file, you may encounter a merge conflict when trying to pull changes.
  • RStudio will show you the conflicting files, and conflict markers <<<<<<<, =======, >>>>>>>
  • The lines between the lines beginning <<<<<<< and ======
    • are what you already had locally - you can tell because “HEAD” is written next to it.
  • The lines between the lines beginning ======= and >>>>>>>
    • are what you are trying to pull from the remote repository.
  • you can resolve the conflicts by choosing which changes to keep.
  • After resolving conflicts, you can commit the changes and push them to GitHub as usual.

6 Fork a repository

  • If you want to work on an existing repository, you can fork it to create your own copy.
  • To do this, go to the repository on GitHub:
  • And click on the Fork button in the top right corner.
  • This will create a copy of the repository in your GitHub account.
  • You can then clone this forked repository to your local machine and work on it as described above.

Gridded data Hands-on Session

Plan of the hands-on session

  • Necessary packages
  • Mapping vector data
  • Mapping demographic data
  • Mapping and manipulating gridded data
  • Comparing demographic and environmental indicator

R Main Necessary Tools to Manipulate Spatial Data

To work with rasters and vectors in R, we need some key packages:

  • sf: Support for simple features, a standardized way to encode spatial vector data.
  • stars, terra, or raster to handle raster data.
  • terra replaces the raster package. The interfaces of terra and raster are similar, but terra is simpler, faster, and can do more.
  • ggspatial: Spatial Data Framework for ggplot2 to map data.
  • Here we will also use tidyverse for data manipulation.

Get Ready to Start

  • Go to the IPC_pre_conference_workshop repository on GitHub
  • Download the project folder as a zip file (or fork it to your own GitHub accountand clone it)
  • Open the zip folder IPC_pre_conference_workshop.
  • Double-click on the file IPC_pre_conference_workshop.Rproj included in this project folder.
    • This will open RStudio in a new project environment.
  • Open Analysis.qmd and begin following along with these code chunks.
  • Note: Quarto / Rmarkdown work best if the path to your directory has no space
  • Install any necessary packages if not already installed using this command:
# Installs any necessary packages if not already installed 
# sometime you need to install Rtools before  (https://cran.r-project.org/bin/windows/Rtools/rtools44/rtools.html)
for(
  pkg in c(
    "srvyr", "survey", "tidyverse", "sf", "terra", "ggspatial", "stars", "stringi", "exactextractr", "haven", "spdep", "geodata", "tmap" #, "rasters", "gtools", "srvyr", "survey", "lme4", "broom.mixed", "broom", "remotes"
  )
){
  if(!require(pkg, quietly = TRUE, character.only = TRUE)){
    install.packages(pkg)
  }
}

Working with vector data

Example with DHS GPS data of senegal

  • Open the file in R
senegal19gps <- st_read("data/SNG_gps/SNGE8BFL.shp")
class(senegal19gps$geometry)
  • Check the CRS of the data
st_crs(senegal19gps)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]

Quick plotting with plot

plot(senegal19gps$geometry)

More Enhanced Plot with ggplot

ggplot() +
  layer_spatial(senegal19gps, fill = NA) +
  theme_minimal() # You can also use theme_void() for a blank canvas

Add north arrow and bar scale

ggplot() +
  layer_spatial(senegal19gps, fill = NA) +
  theme_void()  +
  labs(
    title = "Fig- Map of DHS clusters"
  ) +
  annotation_scale(
    location = "br",
    pad_x = unit(3, "cm")
  ) +
  annotation_north_arrow(
    location = "tr",
    pad_x = unit(0.3, "in"),
    style = north_arrow_fancy_orienteering
  )

Add north arrow and bar scale

Source for Countries’ Administrative Borders (Shapefile)

There are many sources where you can download Shapefile data for countries, such as:

Opening the Shapefile/Basemap of Senegal

  • Using the downloaded files and the sf package to open:
senegal0 <- st_read(  here::here("data", "shapefile_SEN", "gadm41_SEN_0.shp") )
class(senegal0$geometry)
  • Check the CRS of the data
st_crs(senegal0)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
  • The EPSG is 4326

Transforming CRS of the spatial object

  • Load a test shapefile of Senegal
senegal_test <- st_read( here::here("data", "shapefile_SEN_test", "gadm41_SEN_3.shp"))
st_crs(senegal_test)
Coordinate Reference System:
  User input: WGS 84 / UTM zone 28N 
  wkt:
PROJCRS["WGS 84 / UTM zone 28N",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["UTM zone 28N",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",-15,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    ID["EPSG",32628]]
  • The EPSG is 32628

Transforming the senegal_test CRS into senegal0 CRS

senegal_test_project <- st_transform(senegal_test, st_crs(senegal0))
# Alternatively, you can use the EPSG code directly as follows:
# senegal_test_project <- st_transform(senegal_test, crs = 4326)
st_crs(senegal_test_project)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]

More efficient way of getting shapefiles: the geodata package

#download the shapefile of Senegal
poly.adm0 <- geodata::gadm(country="Senegal", level=0, path=tempdir())
poly.adm1 <- geodata::gadm(country="Senegal", level=1, path=tempdir())
poly.adm2 <- geodata::gadm(country="Senegal", level=2, path=tempdir())
poly.adm3 <- geodata::gadm(country="Senegal", level=3, path=tempdir())

#read it with the Sf package
adm0 <- sf::st_as_sf(poly.adm0)
adm1 <- sf::st_as_sf(poly.adm1)
adm2 <- sf::st_as_sf(poly.adm2)
adm3 <- sf::st_as_sf(poly.adm3)
#str(adm2)

Plot the country borders in R

ggplot() +
    layer_spatial(adm0, fill = NA ) +
theme_minimal() # theme_void()

Plot the country’s admin1 borders

ggplot() +
    layer_spatial(adm1, fill = NA,  color = "blue") +
theme_minimal() # theme_void()

Plot the country’s admin2 borders

ggplot() +
    layer_spatial(adm2, fill = NA, color = "red" ) +
theme_minimal() # theme_void()

Plot both DHS clusters and country’s admin1 borders

ggplot() +
    layer_spatial(adm1, fill = NA  ) +
    layer_spatial(senegal19gps,fill = NA) +
theme_minimal() # theme_void()

Working with Individual Level Data: Example with DHS Individual Dataset

Preparing the Data

  • We created a Moderately or Severely Wasted indicator: children aged 0-59 months whose weight-for-height z-score is below -2.0 standard deviations (SD) from the median on the WHO Child Growth Standards (i.e., hc72 < -200).

  • First, we load the demo data.

load(here::here("data", "demodata",  "demodata.RData"))
  • We then weight the data, taking into account the DHS complex sample design.
demodatawt <- demodata %>% as_survey_design(ids = psu, strata = strata, weights = wt, nest = TRUE)
  • Create the table for the proportion by admin1 subdivision.
demodata_prev_admin1 <- demodatawt  %>% 
  srvyr::group_by(region) %>% 
  srvyr::summarize(
    waist_prev = survey_mean(waisted)
  )
  • Merge with the basemap.
demodata_prev_admin1_shp = left_join( adm1%>% mutate(CC_1=as.double(CC_1)), demodata_prev_admin1 %>% mutate(CC_1=as.double(region)), by = join_by(CC_1))
class(demodata_prev_admin1_shp)
[1] "sf"         "data.frame"

Ploting waisting prevalence at country’s admin 1 level

 # ggplot() +
 #  geom_sf(data = demodata_prev_admin1_shp, aes(fill = waist_prev)) +
 #  scale_fill_gradient2(name= "Waisting prevalence", low = "#4375B7", high = "#D72B22", na.value = "transparent") +
 # theme_minimal() # theme_void()
 
 ggplot() + 
  layer_spatial(demodata_prev_admin1_shp, aes(fill = waist_prev)) +
  scale_fill_viridis_c(name= "Waisting prevalence", na.value = NA) +
  theme_minimal()

Ploting waisting prevalence at country’s admin 1 level

Plotting at Lower Administrative Level

  • Plotting at lower administrative levels (e.g., Admin 2) is more challenging because DHS randomly displaces the clusters’ GPS latitude/longitude positions for confidentiality.

  • However, the displacement is restricted so that the points stay within the country’s Admin2 area.

  • Admin2 names or polygons are not included in the GPS data, so we need to spatially join the DHS GPS data with the Admin2 base-map from GADM.

Spatially Join the Two Datasets with st_join to Get Admin2 Names

senegal19gps_adm2 <- st_join(senegal19gps, adm2)
# we check that it worked well
anti_join(senegal19gps_adm2 |> st_drop_geometry(), adm2, by = "NAME_2") # 0 # To check the gps points that are not placed in any entity from the basemap
anti_join(adm2, senegal19gps_adm2 |> st_drop_geometry(), by = "NAME_2") # 0 # To check the map entities that don't have any gps inside them

Preparing the data

  • then we merge individual DHS data with DHS GPS dataset which has admin 2 names
demodata_2 <- left_join(demodata, senegal19gps_adm2, by = join_by(cluster_number == DHSCLUST))
anti_join(demodata, senegal19gps, by = join_by(cluster_number == DHSCLUST)) # This should have 0 rows if all rows are matched in the left_join
  • then we weight the data taking in account the complex sample designs
demodatawt <- demodata_2 %>% as_survey_design(ids = psu, strata = strata, weights = wt, nest = TRUE)
  • creating the table for the proportion by admin 2
demodata_prev_admin2 <- demodatawt  %>% 
  srvyr::group_by(NAME_2) %>% 
  srvyr::summarize(
    waist_prev = survey_mean(waisted)
    
  )
  • Joining the prevalence dataframe with country admin2 borders
demodata_prev_admin2_shp <- left_join(adm2, demodata_prev_admin2, by = join_by(NAME_2 == NAME_2)) 
# Check that all names match : anti_join(senegal19_basemap2, senegal19_prev_admin2, by = join_by(NAME_2 == NAME_2))

Then plotting waisting prevalence at admin 2 level

 wasting_plot2 = ggplot() + 
  layer_spatial(demodata_prev_admin2_shp, aes(fill = waist_prev)) +
  scale_fill_viridis_c(name= "Waisting prevalence", na.value = NA) +
  theme_minimal()

wasting_plot2
  • save it
ggsave(here::here("results",  "prev_waist.png"), width = 15, height = 10)

Zoom in on a Specific Region (Dakar)

# Filter the data for the Dakar region
demodata_prev_admin2_shp_dkr = demodata_prev_admin2_shp[demodata_prev_admin2_shp$NAME_1 == "Dakar", ]
demodata_prev_admin2_shp_dkr = demodata_prev_admin2_shp %>% filter(NAME_1 == "Dakar")
#plot it
ggplot() + 
  layer_spatial(demodata_prev_admin2_shp_dkr, aes(fill = waist_prev)) +
  scale_fill_viridis_c(name= "Waisting prevalence", na.value = NA) +
  theme_minimal()

or

# fixe the limit of the Senegal's map to show Dakar region only
ggplot() + 
  layer_spatial(demodata_prev_admin2_shp, aes(fill = waist_prev)) +
  scale_fill_viridis_c(name= "Waisting prevalence", na.value = NA) +
  coord_sf(xlim = c(-17.5, -17.1), ylim = c( 14.5, 14.9)) +
  theme_minimal()

Working with raster/gridded data

Some Sources for Environmental Gridded Data

Downloading example for CHIRPS data

  • Go to https://climateserv.servirglobal.net/map
  • Select or draw your area
  • Type of request: raw data
  • Dataset type: observation
  • Data source: UCSB CHIRPS rainfall, select period
  • Format: tif
  • Range: 01/01/2015 - 31/12/2020
  • Submit query (it takes a few minutes)

Raster files with terra package

  • The stars, raster, and terra packages allow you to read raster data.

  • The terra package has a single object class for raster data, SpatRaster.

  • A SpatRaster represents a spatially referenced surface divided into three-dimensional cells (rows, columns, and layers).

  • When a SpatRaster is created from a file, it does not load the cell (pixel) values into memory (RAM).

  • It only reads the parameters that describe the geometry of the SpatRaster, such as the number of rows and columns and the coordinate reference system. The actual values are only read when needed.

Opening a single layer raster file

We open the file 20200508.tif located in the chirps_tif2 folder with rast command

prec_20200508 <- rast(here::here("data", "chirps_tif2",  "20200508.tif"))

#prec_20230508 <- rast("data/chirps_tif2/20200508.tif") # works as well
class(prec_20200508)
[1] "SpatRaster"
attr(,"package")
[1] "terra"

Then we check the object created by displaying it

prec_20200508
class       : SpatRaster 
dimensions  : 108, 173, 1  (nrow, ncol, nlyr)
resolution  : 0.05, 0.05  (x, y)
extent      : -19.1, -10.45, 12.05, 17.45  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : 20200508.tif 
name        : 20200508 

This output summaries the .tif file for August 5, 2020 (2020/05/08). Note that there is:

  • 108 rows of pixels
  • 173 columns of pixels
  • 1 layer named 20200508

Ploting the raster with ggplot/ggspatial

ggplot() +
    layer_spatial(  prec_20200508 ) + 
  labs(fill = "Daily rainfall in mm (2020/05/08)") +
theme_minimal()

Adding Senegal basemap

  • First Checking CRS
st_crs(adm2)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
st_crs(prec_20200508)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        MEMBER["World Geodetic System 1984 (G2296)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]

Then the plot

ggplot() +
    layer_spatial(
    prec_20200508
  ) +
    layer_spatial(
    adm2,
    fill = NA,
color= "red"
  ) +
  theme_minimal()

Croping a raster

  • Before cropping, it’s always a good idea to check the CRS of both the raster and the spatial object

  • It will not work if they don’t match.

  • The command bellow gives an error message

#prec_20230508_crop  <- crop(prec_20200508, senegal_test)
  • The one bellow works because the 2 objects have the same CRS
prec_20200508_crop  <- crop(prec_20200508, adm0)
prec_20200508_crop
class       : SpatRaster 
dimensions  : 88, 124, 1  (nrow, ncol, nlyr)
resolution  : 0.05, 0.05  (x, y)
extent      : -17.55, -11.35, 12.3, 16.7  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
varname     : 20200508 
name        : 20200508 
min value   : 0.000000 
max value   : 1.799023 
prec_20200508
class       : SpatRaster 
dimensions  : 108, 173, 1  (nrow, ncol, nlyr)
resolution  : 0.05, 0.05  (x, y)
extent      : -19.1, -10.45, 12.05, 17.45  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : 20200508.tif 
name        : 20200508 

Saving raster data

  • Sometime you may want to save cropped raster data or some data modified
terra::writeRaster(prec_20200508_crop, here::here("data", "modified_data", "prec_prec_20200508_crop.tif"), filetype = "GTiff" ,  overwrite=TRUE)

Extracting raster values for a polygon

  • exactextract package provides a fast and accurate algorithm for summarizing values in the portion of a raster dataset that is covered by a polygon,
  • Unlike other zonal statistics implementations, it takes into account raster cells that are partially covered by the polygon
prec_by_districts <- exactextractr::exact_extract(
  prec_20200508, 
  adm2, 
  fun = "mean"
) %>% 
as.data.frame() %>%
  dplyr::rename_with(
    ~ifelse(
      stringr::str_detect(.x, "\\."), 
      paste0("chirps")
    )
  ) #%>%
  # bind_cols(senegal, .)
  • So, extract() returns a data.frame, where ID values represent the corresponding row number in the polygons data.

Summary of the values

summary(prec_by_districts)
     chirps        
 Min.   :0.000000  
 1st Qu.:0.000000  
 Median :0.000000  
 Mean   :0.001005  
 3rd Qu.:0.000000  
 Max.   :0.044226  

Working with multilayer raster

You can stack multiple raster layers of the same spatial resolution and extent to create a RasterStack using raster::stack() or RasterBrick using raster::brick().

Bur rast of terra package is more powerful

Often times, processing a multi-layer object has computational advantages over processing multiple single-layer one by on

# the list of path to the files
files_list <- c("data/chirps_tif2/20200131.tif", "data/chirps_tif2/20200201.tif")

#read the two at the same time 

  multi_layerraster<- rast(files_list)
multi_layerraster
class       : SpatRaster 
dimensions  : 108, 173, 2  (nrow, ncol, nlyr)
resolution  : 0.05, 0.05  (x, y)
extent      : -19.1, -10.45, 12.05, 17.45  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
sources     : 20200131.tif  
              20200201.tif  
names       : 20200131, 20200201 

Working with all files in a single folder as a list

#first import all files in a single folder as a list 
rastlist <- list.files(path = "data/chirps_tif2", pattern='.tif$', all.files= T, full.names= T)
#--- read the all at the same time ---#
allprec <- terra::rast(rastlist)
allprec
class       : SpatRaster 
dimensions  : 108, 173, 2192  (nrow, ncol, nlyr)
resolution  : 0.05, 0.05  (x, y)
extent      : -19.1, -10.45, 12.05, 17.45  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
sources     : 20150101.tif  
              20150102.tif  
              20150103.tif  
              ... and 2189 more sources
names       : 20150101, 20150102, 20150103, 20150104, 20150105, 20150106, ... 

Or better create one multi-layered raster for each year

# first we create a year list
years <- map(2015:2020, ~{
  list.files("data/chirps_tif2/", pattern = paste0("^", .x), full.names = TRUE)
})

# rename the list
years <- set_names(years, 2015:2020) 


#create a multi-layer raster per year and store all years in one large list

years <- years %>% map(~.x %>% rast)
years$`2019`
class       : SpatRaster 
dimensions  : 108, 173, 365  (nrow, ncol, nlyr)
resolution  : 0.05, 0.05  (x, y)
extent      : -19.1, -10.45, 12.05, 17.45  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
sources     : 20190101.tif  
              20190102.tif  
              20190103.tif  
              ... and 362 more sources
names       : 20190101, 20190102, 20190103, 20190104, 20190105, 20190106, ... 

Yearly rainfall accumulation

# for a sigle year
years$`2019` %>% sum()
class       : SpatRaster 
dimensions  : 108, 173, 1  (nrow, ncol, nlyr)
resolution  : 0.05, 0.05  (x, y)
extent      : -19.1, -10.45, 12.05, 17.45  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :        sum 
min value   :   60.60399 
max value   : 2173.61415 
#More efficiently, we’ll apply the same sum function to every year in our list
chirps_yearly_sum <- map(years, ~.x %>% sum)

Creating a raster of 1 layer per year

chirps_yearly_sum <- rast(chirps_yearly_sum)
chirps_yearly_sum
class       : SpatRaster 
dimensions  : 108, 173, 6  (nrow, ncol, nlyr)
resolution  : 0.05, 0.05  (x, y)
extent      : -19.1, -10.45, 12.05, 17.45  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       :       2015,      2016,       2017,       2018,       2019,      2020 
min values  :   78.81139,  110.7043,   65.90624,   96.83039,   60.60399,  102.1226 
max values  : 2611.19193, 1902.9028, 1901.80054, 2018.84273, 2173.61415, 2656.6543 

For every pixel (0.05 degrees lat by 0.05 degree lon), we now have the total rainfall accumulation for every year 2015-2020.

Let us plot it for 2019

precip_plot = ggplot() + 
  layer_spatial(mask(chirps_yearly_sum$`2019`, vect(adm2), touches = FALSE)) + 
  layer_spatial(adm2, alpha = 0) +
  theme_minimal() + 
  scale_fill_gradient2(low = "#D72B22", high = "#4375B7", na.value = "transparent") + 
  labs(fill = "Total precipitation (2019)")
precip_plot

Caculation with raster data

  • Example with the z-score of the yearly rainfall accumulation

First we calculate the average yearly rainfall accumulation for each pixel (across all years):

chirps_avg <- mean(chirps_yearly_sum)

The standard deviation from that average:

chirps_sd <- stdev(chirps_yearly_sum)

Finally we can use both to compute a Z-score for each pixel in each year.

chirps_z <- (chirps_yearly_sum - chirps_avg) / chirps_sd
class       : SpatRaster 
dimensions  : 108, 173, 6  (nrow, ncol, nlyr)
resolution  : 0.05, 0.05  (x, y)
extent      : -19.1, -10.45, 12.05, 17.45  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       :       2015,      2016,      2017,      2018,      2019,      2020 
min values  : -0.9935492, -1.511942, -1.718446, -1.563885, -2.087076, -1.319390 
max values  :  2.2231631,  1.690862,  1.262613,  1.024049,  1.660715,  2.188421 

We can now extract the values per districts or any area of interest for our study

We extract mean values by districts, rename columns and merge with senegal borders shapefile

tot_yearly_prec_by_dep <- exactextractr::exact_extract(
  chirps_yearly_sum, 
  adm2, 
  fun = "mean"
) %>% 
  dplyr::rename_with(
    ~ifelse(
      stringr::str_detect(.x, "\\."), 
      paste0("chirps", .)
    )
  ) %>%
   bind_cols(adm2, .)
#summary(chirps_yearly_sum)
class(tot_yearly_prec_by_dep)

Plot the mean total precipitaion at district level in 2019

precip_plot2= ggplot() + 
    layer_spatial(tot_yearly_prec_by_dep, aes(fill = `chirpsmean.2019`)) +
  theme_minimal() + 
  scale_fill_gradient2(low = "#D72B22", high = "#4375B7", na.value = "transparent") + 
  labs(fill = "Mean total precipitation (2019)")
precip_plot2

Comparing precipitation and chlid wasting prevalence

ggarrange(wasting_plot2, precip_plot2,
          common.legend = FALSE,  legend = "bottom")

Interractive maps with tmap

tmap_mode("view")

sengal <-tm_shape(demodata_prev_admin2_shp) +
  tm_borders() +
  tm_polygons(col ="waist_prev", palette="Greens", values.range=.9, id="NAME_2", title="Waisting prevalence") +
 # tm_compass() + tm_scale_bar() + tm_layout(legend.outside = TRUE) +
  #tmap_options(check.and.fix = TRUE) +
  tm_scale_bar(position = c("left", "bottom"))

sengal

THANK YOU FOR YOUR ATTENTION

Contact

Arlette Simo Fotso (Researcher, INED) : arlette.simo-fotso@ined.fr

To go further: spatial autocorrelation

Moran’s I

  • Though our visual senses can, in some cases, discern clustered regions from non-clustered regions, the distinction may not always be so obvious
  • One popular measure of spatial autocorrelation is the Moran’s I coefficient
  • Read more about it here
  • Define neighboring polygons
    • We must first define what is meant by “neighboring” polygons
    • contiguous neighbor, distance based neighbor (allows for annulus neighbors) or k nearest neighbor
  • we need to assign weights to each neighboring polygon

Let’s do it with the spdep package

  • we’ll adopt a contiguous neighbor definition where we’ll accept any contiguous polygon that shares at least on vertex
nb <- poly2nb(demodata_prev_admin2_shp, queen=TRUE)
  • In our case, each neighboring polygon will be multiplied by the weight 1/(N of neighbors) such that the sum of the weights equal 1
lw <- nb2listw(nb, style="W", zero.policy=FALSE) #Setting zero.policy to FALSE will return an error if at least one polygon has no neighbor
  • Computing the Moran’s I coefficient demodata_prev_admin2_shp, aes(fill = waist_prev
moran(demodata_prev_admin2_shp$waist_prev, listw = lw, n = length(nb), S0 = Szero(lw))
$I
[1] 0.4280556

$K
[1] 3.877164

Assessing statistical significance

  • Monte Carlo approach
MC<- moran.mc(demodata_prev_admin2_shp$waist_prev, lw, nsim = 999)
MC$p.value
[1] 0.001
  • It is also possible to compute Local Moran’s I, but you need large data

  • More here